BIOM:生物观测矩阵——微生物组数据通用数据格式
简介
http://biom-format.org/
BIOM格式是微生物组领域最常用的结果保存格式,优点是可将OTU或Feature表、样本属性、物种信息等多个表保存于同一个文件中,且格式统一,体积更小巧,目前被微生物组领域几乎所有主流软件所支持:
QIIME
MG-RAST
PICRUSt
Mothur
phyloseq
MEGAN
VAMPS
metagenomeSeq
Phinch
RDP Classifier
USEARCH
PhyloToAST
EBI Metagenomics
GCModeller
MetaPhlAn 2
BIOM格式于2012年Rob Knight首发于我国GigaScience杂志上,被引242次。
The Biological Observation Matrix (or BIOM, canonically pronounced biome) 是微生物组分析的核心数据类型。
我们主要了解以下三方面的内容:
BIOM文件格式的定义;
biom命令对文件格式的转换、添加元数据、总结等;
使用Python和R操作BIOM文件
biom工具安装
常用的biom操作工具是一个python包,可通过pip、conda等安装
# 安装依赖关系科学计算包
pip install numpy
# 安装biom包
pip install biom-format
# 安装biom2.0格式支持
pip install h5py
# 显示命令行
biom
更推荐,conda安装 python和r相应的操作包
相应bioconda包在 https://bioconda.github.io/recipes.html 查询名称和版本详细
# 安装Python包
conda install biom-format # 2.1.7
# 安装r的biom包
conda install r-biom
# 或安装r微生物组包,包括了r-biom
conda install bioconductor-microbiome
主要功能如下
sage: biom [OPTIONS] COMMAND [ARGS]...
ptions:
--version 版本Show the version and exit.
-h, --help 帮助Show this message and exit.
ommands:
add-metadata 添加元数据 Add metadata to a BIOM table.
convert 文本表格与biom互转 Convert to/from the BIOM table format.
from-uc 转换uc为biom Create a BIOM table from a vsearch/uclust/usearch BIOM...
head 跳过表头 Dump the first bit of a table.
normalize-table 标准化 Normalize a BIOM table.
show-install-info 提供安装信息 Provide information about the biom-format installation.
subset-table 提取子集 Subset a BIOM table.
summarize-table 统计摘要 Summarize sample or observation data in a BIOM table.
table-ids 转储 Dump IDs in a table.
validate-table 格式验证 Validate a BIOM-formatted file.
文件格式
http://biom-format.org/documentation/biom_format.html
BIOM目前分为1.0 JSON
和2.0 HDF5
两个版本;
1.0 JSON是编程语言广泛支持的格式,类似于散列的键值对结果。会根据数据松散程度,选择不同的存储结构来节省空间。
2.0 HDF5是二进制格式,被许多程序语言支持,读取更高效和节约空间。
小提示和常见问题
BIOM的目的是存储和处理大、松散的表;储存研究主要信息为单个文件;格式在不同软件间通用。
下面是OTU表常用存储的两种样式
紧密OTU表 A dense representation of an OTU table:
OTU ID PC.354 PC.355 PC.356
OTU0 0 0 4
OTU1 6 0 0
OTU2 1 0 7
OTU3 0 0 3
松散OTU表 A sparse representation of an OTU table:
PC.354 OTU1 6
PC.354 OTU2 1
PC.356 OTU0 4
PC.356 OTU2 7
PC.356 OTU3 3
OTU表经常会有90%的0,甚至99%为0。其中BIOM 1.0支持松散、紧密两种格式;BIOM2.x仅支持松散格式。
封装核心研究数据(OTU表、样本信息和OTU物种注释)至单个文件
快速使用Quick Start
本节讲指在python中交互操作biom格式文件,我不常用,具体见附录1.
文件格式转换
convert命令可以将文本格式的表格与biom格式间自由转换。
转换为制表符分隔的表格,方便在Excel等程序中查看;
转换松散或紧密格式的biom(biom1.0只支持紧密dense格式)
制表符分隔的表格通常称为经典格式表格,BIOM格式称为biom表格。
转换经典表格为HDF5或JSON格式
biom convert -i table.txt -o table.from_txt_json.biom --table-type="OTU table" --to-json
biom convert -i table.txt -o table.from_txt_hdf5.biom --table-type="OTU table" --to-hdf5
转换biom为经典格式
biom convert -i table.biom -o table.from_biom.txt --to-tsv
转换biom为经典格式,并在最后列包括物种注释信息
biom convert -i table.biom -o table.from_biom_w_taxonomy.txt --to-tsv --header-key taxonomy
转换biom为经典格式,并在最后列包括物种注释信息,并改名为ConsensusLineage
此功能对于一些软件要求指定的列名有很有用。
biom convert -i table.biom -o table.from_biom_w_consensuslineage.txt --to-tsv --header-key taxonomy --output-metadata-id "ConsensusLineage"
带物种注释表格互转
biom convert -i table.biom -o table_tax.txt --to-tsv --header-key taxonomy
biom convert -i table_tax.txt -o new_table.biom --to-hdf5 --table-type="OTU table" --process-obs-metadata taxonomy
biom convert -i table_tax.txt -o new_table.biom --to-json --table-type="OTU table" --process-obs-metadata taxonomy
转换QIIME1.4早期表格为BIOM格式(不常用)
sed 's/Consensus Lineage/ConsensusLineage/' < otu_table.txt | sed 's/ConsensusLineage/taxonomy/' > otu_table.taxonomy.txt
biom convert -i otu_table.taxonomy.txt -o otu_table.from_txt.biom --table-type="OTU table" --process-obs-metadata taxonomy --to-hdf5
biom文件添加样本分组和物种注释
biom add-metadata -h
# 显示帮助
Usage: biom add-metadata [OPTIONS]
Add metadata to a BIOM table.
Add sample and/or observation metadata to BIOM-formatted files. See
examples here: http://biom-format.org/documentation/adding_metadata.html
Example usage:
Add sample metadata to a BIOM table:
$ biom add-metadata -i otu_table.biom -o table_with_sample_metadata.biom
-m sample_metadata.txt
Options:
-i, --input-fp FILE 输入文件The input BIOM table [required]
-o, --output-fp FILE 输出文件The output BIOM table [required]
-m, --sample-metadata-fp FILE 样本信息The sample metadata mapping file (will add
sample metadata to the input BIOM table, if
provided).
--observation-metadata-fp FILE OTU物种注释 The observation metadata mapping file (will
add observation metadata to the input BIOM
table, if provided).
--sc-separated TEXT 元数据按分号分隔,如物种分类级 Comma-separated list of the metadata fields
to split on semicolons. This is useful for
hierarchical data such as taxonomy or
functional categories.
--sc-pipe-separated TEXT 元数据按竖线分隔,如lefse Comma-separated list of the metadata fields
to split on semicolons and pipes ("|"). This
is useful for hierarchical data such as
functional categories with one-to-many
mappings (e.g. x;y;z|x;y;w)).
--int-fields TEXT 分号分隔的整数 Comma-separated list of the metadata fields
to cast to integers. This is useful for
integer data such as "DaysSinceStart".
--float-fields TEXT 分号分隔的符点数 Comma-separated list of the metadata fields
to cast to floating point numbers. This is
useful for real number data such as "pH".
--sample-header TEXT 指定样本属性列名 Comma-separated list of the sample metadata
field names. This is useful if a header line
is not provided with the metadata, if you
want to rename the fields, or if you want to
include only the first n fields where n is
the number of entries provided here.
--observation-header TEXT OTU属性样名 Comma-separated list of the observation
metadata field names. This is useful if a
header line is not provided with the
metadata, if you want to rename the fields,
or if you want to include only the first n
fields where n is the number of entries
provided here.
--output-as-json 输出JSON格式 Write the output file in JSON format.
-h, --help 帮助 Show this message and exit.
你的样本分组文件是这样格式的
head sample.txt
#SampleID BarcodeSequence genotype
KO1 TAGCTT KO
KO2 GGCTAC KO
KO3 CGCGCG KO
你的物种注释信息是这样的
head taxonomy.txt
#OTUID taxonomy confidence
OTU_325 k__Bacteria;p__Bacteroidetes;c__Flavobacteriia;o__Flavobacteriales;f__Cryomorphaceae;g__;s__ 0.880
OTU_324 k__Bacteria;p__Chlorobi;c__SJA-28;o__;f__;g__;s__ 1.000
添加样本分组信息
biom add-metadata -i table.biom -o table.w_smd.biom --sample-metadata-fp sample.txt
添加OTU注释
biom add-metadata -i table.biom -o table.w_omd.biom --observation-metadata-fp taxonomy.txt
添加样本和OTU注释
biom add-metadata -i table.biom -o table.w_md.biom --observation-metadata-fp taxonomy.txt --sample-metadata-fp sample.txt
同时添加行列信息
可以指定注释的列格式,如整数integers (—int-fields)、浮点小数 (—float-fields)、或物种层级注释并用分号分隔 (—sc-separated)
biom add-metadata -i table.biom -o table.w_md.biom --observation-metadata-fp taxonomy.txt --sample-metadata-fp sample.txt --sc-separated taxonomy --float-fields confidence
—observation-header和—sample-header可以重命名列名,
biom add-metadata -i min_sparse_otu_table.biom -o table.w_smd.biom --sample-metadata-fp sam_md.txt --sample-header SampleID,BarcodeSequence,DateOfBirth
biom add-metadata -i min_sparse_otu_table.biom -o table.w_omd.biom --observation-metadata-fp obs_md.txt --observation-header OTUID,taxonomy,confidence
可以指定名称的列读入
biom add-metadata -i min_sparse_otu_table.biom -o table.w_omd.biom --observation-metadata-fp obs_md.txt --observation-header OTUID,taxonomy --sc-separated taxonomy
BIOM表统计
biom summarize-table -h
统计每个样品
biom summarize-table -i table.w_md.biom -o table.w_md_summary.txt
示例结果如下:
Num samples: 27
Num observations: 975
Total count: 409647
Table density (fraction of non-zero values): 0.464
Counts/sample summary:
Min: 2352.0
Max: 35955.0
Median: 14851.000
Mean: 15172.111
Std. dev.: 10691.823
Sample Metadata Categories: BarcodeSequence; genotype
Observation Metadata Categories: taxonomy; confidence
Counts/sample detail:
OE4: 2352.0
OE3: 2353.0
OE8: 3091.0
OE2: 3173.0
统计每个样本中的观察值数量unique observations per sample,即alpha多样性 richness
biom summarize-table -i table.w_md.biom --qualitative -o table.w_md_qual_summary.txt
结果如下:
Num samples: 27
Num observations: 975
Observations/sample summary:
Min: 222
Max: 633
Median: 486.000
Mean: 452.704
Std. dev.: 138.713
Sample Metadata Categories: BarcodeSequence; genotype
Observation Metadata Categories: taxonomy; confidence
Observations/sample detail:
OE3: 222
OE4: 248
OE8: 261
OE1: 272
OE2: 278
Reference
The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome.
Daniel McDonald, Jose C. Clemente, Justin Kuczynski, Jai Ram Rideout, Jesse Stombaugh, Doug Wendel, Andreas Wilke, Susan Huse, John Hufnagle, Folker Meyer, Rob Knight, and J. Gregory Caporaso.
GigaScience 2012, 1:7. doi:10.1186/2047-217X-1-7
http://biom-format.org/
附录1. Python中交互操作biom的函数
函数
Python中只要有biom包,可在Python交互的命令行中读取
load_table(f) 函数读取biom文件
读取并展示biom的内置数据
>>> from biom import example_table
>>> print(example_table)
# Constructed from biom file
#OTU ID S1 S2 S3
O1 0.0 1.0 2.0
O2 3.0 4.0 5.0
从文件读取biom文件
from biom import load_table
table = load_table('otutab.biom')
Table函数
Table(data, observation_ids, sample_ids[, …])
import numpy as np
from biom.table import Table
data = np.arange(40).reshape(10, 4)
sample_ids = ['S%d' % i for i in range(4)]
observ_ids = ['O%d' % i for i in range(10)]
sample_metadata = [{'environment': 'A'}, {'environment': 'B'},
{'environment': 'A'}, {'environment': 'B'}]
observ_metadata = [{'taxonomy': ['Bacteria', 'Firmicutes']},
{'taxonomy': ['Bacteria', 'Firmicutes']},
{'taxonomy': ['Bacteria', 'Proteobacteria']},
{'taxonomy': ['Bacteria', 'Proteobacteria']},
{'taxonomy': ['Bacteria', 'Proteobacteria']},
{'taxonomy': ['Bacteria', 'Bacteroidetes']},
{'taxonomy': ['Bacteria', 'Bacteroidetes']},
{'taxonomy': ['Bacteria', 'Firmicutes']},
{'taxonomy': ['Bacteria', 'Firmicutes']},
{'taxonomy': ['Bacteria', 'Firmicutes']}]
table = Table(data, observ_ids, sample_ids, observ_metadata,
sample_metadata, table_id='Example Table')
table # 表格信息
print(table) # 输出表格
print(table.ids()) # 显示样本名
print(table.ids(axis='observation')) # 显示观测值名称
print(table.nnz) # 非零number of nonzero entries
我更喜欢命令行模型,对于Python中交互使用,更多代码详见 http://biom-format.org/documentation/table_objects.html
猜你喜欢
10000+:菌群分析 宝宝与猫狗 梅毒狂想曲 提DNA发Nature Cell专刊 肠道指挥大脑
文献阅读 热心肠 SemanticScholar Geenmedical
16S功能预测 PICRUSt FAPROTAX Bugbase Tax4Fun
生物科普: 肠道细菌 人体上的生命 生命大跃进 细胞暗战 人体奥秘
写在后面
为鼓励读者交流、快速解决科研困难,我们建立了“宏基因组”专业讨论群,目前己有国内外2600+ 一线科研人员加入。参与讨论,获得专业解答,欢迎分享此文至朋友圈,并扫码加主编好友带你入群,务必备注“姓名-单位-研究方向-职称/年级”。PI请明示身份,另有海内外微生物相关PI群供大佬合作交流。技术问题寻求帮助,首先阅读《如何优雅的提问》学习解决问题思路,仍末解决群内讨论,问题不私聊,帮助同行。
学习16S扩增子、宏基因组科研思路和分析实战,关注“宏基因组”